The course learning objectives are:
Link to my GitHub repository for this course.
I prepared myself for this weeks topic by completing DataCamp exercise on “Regression and model validation”.
For the data wrangling part of this weeks exercises I mostly used the methods learned from the DataCamp exercise. I learned the basics of data wrangling in R and know how to find out more methods/tools and how to use them to wrangle the data.
My source code used in the wrangling is here.
Basically, here are the main points what is done in the code:
str and dim.learning2014 is created as new empty data.frame to store the required variables.gender, age, attitude and points are simply copied from the orignal dataset to learning2014.learning2014. dplyr library is used here.points < 1 are filtered out from learning2014.The dataset (learning2014.txt) that the code created can be found from here.
With dim function we learn the dimension of the input dataset. That is the function returns the number of columns and rows.
dim(learning2014)
## [1] 166 7
With str function we learn about the structure of the input dataset. Basically we get to know the number of observations (rows) and variables (columns) in the dataset. The output summary from the function lists the variables, the type of each variable vector and few example values from the vector.
str(learning2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
I use ggpairs to show detailed graphical overview of the input dataset, just like it was done in the DataCamp course for this weeks exercises. ggpairscreates matrix of plots from the given dataset. With aesthetics mapping we are instructing the colors to be based gender variable and the transparency is controlled with value give to alpha parameter. Parameter lower controls the type of plots for lower section of the plot matrix. All plots are between pair of variables in the learning2014 dataset and each are categorized based on gender variable. Above the diagonal with density plots there are the correlation coeffiecients of the pairs as a whole and then categorized based on gender variable.
From the plots we can see that nearly 2/3 of the observations are from females, thus 1/3 are from males. When we look at the single variables categorized by gender, we see that variable attitude deviates the most between sexes. Also females tend to be more younger than males based on variable age. There is also deviations between female and male in the combined question variables, where surface question group deviates the most and strategy to some extent.
When we look at the correlation coefficients we see that points and attitude correlate in agreement the most (0.437) in the dataset variables, with equally high for both females and males. Then deep and surf combined question variables correlate in disagreement the most (-0.324), with amongst all males the variables disagree (-0.622) clearly more than amongst females (-0.087). Then there are several pairs that have similar correlation in agreement (age,strategic), (deep,attitude) and (points,strategic), and several pairs that have similar correlation in disagreement (strategic,surface), (points,surface), (attitude,surface) and (age,surface).
ggpairs(learning2014,
mapping = aes(col = gender, alpha = 0.3),
lower = list(combo = wrap("facethist", bins = 20)))
With summary command we can see statistics of each variable in the dataset learning2014. For categorical variable gender we only get number of values for females and males. Then for each of the numerical varibles we get minimum, maximum, mean, median, 1st and 3rd quantiles.
summary(learning2014)
## gender age attitude points deep
## F:110 Min. :17.00 Min. :14.00 Min. : 7.00 Min. :1.583
## M: 56 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00 1st Qu.:3.333
## Median :22.00 Median :32.00 Median :23.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :22.72 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :33.00 Max. :4.917
## surf stra
## Min. :1.583 Min. :1.250
## 1st Qu.:2.417 1st Qu.:2.625
## Median :2.833 Median :3.188
## Mean :2.787 Mean :3.121
## 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.333 Max. :5.000
From the summary below we can see the explanatory variables (attitude, age, stra) I used to build the model with lm function. I reran the code several times with different combination of variables to find the most fit variables for the model by comparing the summaries of each of the models. Residuals are the difference between actual values of points and the predicted values of points. For most regressions the residuals should be normally distributed, where for goodness of the model can be estimated by the value of mean being close to 0. And here we are close to zero (0.3303). There is a shorthand for estimating the signifigance of the variables at the end of line for each variable in the Coefficients section of the summary. The more the stars at the end of the line the more significant the variable is, dot is still very good. And here we have *** for attitude and . for both stra and age. The p-value tells the probability of variable being NOT relevant, so small values here are good. And as we can see we have very low value for attitude and and still significantly low for both stra and age.
R-squared value tells how close the data are to the fitted regression line. It indicates the percentage that the model explains of the variability of the response data around its mean. 0% when none, 100% when all of the variability is explained. Here we have 21%. In general the higher the R-squared the better the model fits the data. Low R-squared value are not always bad. E.g. on occasions when trying to predict human behavior R-squared values are typically lower than 50%. Here we have a model fitted to variables descibing mostly (results of) human behavior. Also, if R-squared is low but the variables are statistically significant important conclusion can still be drawn. And here we have low R-squared with statistically significant variables.
##
## Call:
## lm(formula = points ~ attitude + stra + age, data = learning2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.1149 -3.2003 0.3303 3.4129 10.7599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.89543 2.64834 4.114 6.17e-05 ***
## attitude 0.34808 0.05622 6.191 4.72e-09 ***
## stra 1.00371 0.53434 1.878 0.0621 .
## age -0.08822 0.05302 -1.664 0.0981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared: 0.2182, Adjusted R-squared: 0.2037
## F-statistic: 15.07 on 3 and 162 DF, p-value: 1.07e-08
Model assumptions describe the data generating process. How well the assumptions fit reality the better the model describes the phenomenom of interest. With linear regression linearity is assumed: target is modelled as a linear combination of model parameters. Usually is also assumed that the errors are normally distributed, errors are not correlated and have constant variance, and that size of given error does not depend on the explanatory variables.
With QQ-plot of the residuals we can explore the assumption that of errors of the model are normally distributed. Here we in the 2nd plot we have a QQ-plot of the model and we can see that the majority of the explanatory variables fit well to the line and to the normality assumption.
The constant variance assumption can be explored with simple scatter plot of residuals versus model predictions. Any pattern on the scatter plot implies a problem with the assumption. Here the scatter plot is the 1st plot, and I must say that it is a bit difficult to say that is there a pattern (problem) in the plot especially when looking at with less fitted values and more fitted values. Then again, in between of less and more fitted values it is well scattered.
Leverage measures how much impact a single observation has on the model and residuals vs. leverage plot helps identifying observations with high impact. Here the last plot is the residuals vs. leverage plot. Again it is little hard for me to tell, but my interpretation is that there is no single outstanding observation that would drag the model.
I prepared myself for this weeks topic by completing DataCamp exercise on “Logistic Regression”.
As I did last week, for the data wrangling part of this weeks exercises I mostly used the methods learned from the DataCamp exercise. I learned about joining datasets and doing data mutations.
My source code used in the wrangling is here.
Basically, here are the main points what is done in the code:
school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,nursery,internetalc_use variable was generated to indicate alcohol use (avarage of daily and weekly consumption)high_use variables was generated to indicate high use of alcohol (true is alc_use > 2)The dataset (alc.csv) that the code created can be found from here.
Step one was to create chapter3.Rmd to inject this content to the diary as a child of subscipts in index.Rmd.
The combined dataset contains the data from two Portuguese schools and were originally gathered from school reports and questionnaires. The data was collected initally in two datasets, one regarding student performance in mathemathics and one regarding student performance in Portuguese language. In addition to section Data Wrangling, in which the join criteria and additionally created variables are discussed, you can find more information about the variables in here.
Dimensions and the column (variable) names of the combined dataset:
dim(alc)
## [1] 382 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
Column (variable) types of the combined dataset:
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob <fct> teacher, other, other, services, other, other, othe...
## $ reason <fct> course, course, other, home, home, reputation, home...
## $ nursery <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures <int> 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, ...
## $ G1 <int> 2, 7, 10, 14, 8, 14, 12, 8, 16, 13, 12, 10, 13, 11,...
## $ G2 <int> 8, 8, 10, 14, 12, 14, 12, 9, 17, 14, 11, 12, 14, 11...
## $ G3 <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 1...
## $ alc_use <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...
My hypothesis to study is how time consumption is related with alcohol consumption. The idea being that if you spend your time with activities you would not have time to consume alcohol and vice versa. My choice of variables are in sense time consuming variables: freetime, goout, studytime and traveltime.
traveltime - Home to school travel time. My intuition here is this would be a factor to decrease alcohol consumption.studytime - Weekly study time. My intuition here is this would be a factor to decrease alcohol consumption.freetime - Free time after school. My intuition here is this would be a factor to increase alcohol consumption.goout - Going out with friends. My intuition here is this would be a factor to increase alcohol consumption.Here I use ggpairs, from last week, to visualize the chosen variables together with the variable high_use. From the diagram we get quickly a good overview of the relations of the variables between themselves and with the variable high_use. By looking at the correlation coefficients we can observe that there is not much correlation between the chosen variables. We could say that freetime and goout mildly correlate positively, meaning that you would go out with friends during your freetime which makes common sense. Also that traveltime and studytime mildly correlate negatively, the same with freetime and studytime, meaning that travel time between school and home, also freetime, is not used for studying. My interpretation is that the low correlation between the chosen variables could tell that variables are not interfering with each other and be independently a factor to increase or decrease alcohol consumption. And that can be seen in favor of my hypothesis.
From the density plots we can see that there are some deviations for each of the chosen variable when comparing to high/low alcohol usage. Most distinctive difference is between goout and high_use where we could say that those who go out more have the tendency to high alcohol consumption. When comparing freetime and high_use density one could argue that when you have less free time there is less observations of high alcohol consumptions which are growing when you have more free time. Similar argument can be made when comparing studytime and high_use, where there are less high alcohol consumption when more time spent on studies. And also when comparing traveltime and high_use one could say that if you live close to school you have more time to consume alcohol. In my interpretation these are still supporting my hypothesis, but it is not so clear because the densities between low/high alcohol consumption per chosen variable are very similar: there are 1/3 high consuming versus 2/3 low consuming observations, and the values of the attiributes are dicrete and few.
Looking at the box and bar plots of each chosen variable divided by alcohol we can see that traveltime and freetime are quite identically distributed both in terms of high and low alcohol consumption. When looking at the high alcohol consumption with freetime it seems that the hight alcohol consumption increases more when freetime increases, whereas low alcohol consumption increases less when freetime increases.
With studytime and goout the distributions differ more in terms of high and low alcohol consumption. When looking at the high alcohol consumption with studytime it seems that the high alcohol consumption increases more when studytime decreases compared to low alcohol consumption. Also high alcohol consumption increases more when go out time with friends increases compared to low alcohol consumption.
Here we use logistic regression to statistically explore the relationship between chosen variables and the binary high/low alcohol consumption.
##
## Call:
## glm(formula = high_use ~ studytime + goout + traveltime + freetime,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5805 -0.7670 -0.5270 0.8878 2.4724
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.9531 0.7013 -4.211 2.54e-05 ***
## studytime -0.5612 0.1655 -3.391 0.000697 ***
## goout 0.7206 0.1212 5.947 2.73e-09 ***
## traveltime 0.3619 0.1752 2.066 0.038868 *
## freetime 0.0921 0.1354 0.680 0.496389
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 396.18 on 377 degrees of freedom
## AIC: 406.18
##
## Number of Fisher Scoring iterations: 4
From the summary we can see the distribution of deviance residuals for the individual cases used in the model. The table of coeffiecients shows the coefficients and its standard error for each variable. The coefficients tell the change in the log odds of the outcome for a one unit increase in the predictor variable (e.g. one unit change in goout gives the log odds of high_use increase by 0.7206). From the p-values we can see that the probability of each variable NOT being relevant (so low score here is good). From the significance indicators we can tell that studytime, goout and traveltime are statistically significant, but freetime is not.
From the below odd ratios of the coefficients we can say that each of the chosen variable is positively associated with high consumption of alcohol (when we have defined high_use as binary, TRUE / FALSE). From the odds ratios and confidence intervals we can say that since studytime is below 1 values (the odds ratio and the confidence interval) it is in favor of low alcohol consumption. Both goout and traveltime is above 1 so they both are in favor of high alcohol consumption. Where as freetime has interval that includes value 1, meaning that it is random to which side the variable is in favor of.
## OR 2.5 % 97.5 %
## (Intercept) 0.0521788 0.01274364 0.2006731
## studytime 0.5705061 0.40821453 0.7826192
## goout 2.0556712 1.63108393 2.6258521
## traveltime 1.4359821 1.01912982 2.0303214
## freetime 1.0964726 0.84079237 1.4317362
Based on summary and odds ratios I conclude that studytime, goout and traveltime support my hypothesis but surprisingly freetime does not. And since freetime is not statistically significant I will drop that variable from further parts of this exercise analysis.
First we need to re-fit the model by taking out the statistically not significant freetime variable.
##
## Call:
## glm(formula = high_use ~ studytime + goout + traveltime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5318 -0.7679 -0.5214 0.8604 2.4526
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.6967 0.5878 -4.588 4.47e-06 ***
## studytime -0.5716 0.1650 -3.464 0.000532 ***
## goout 0.7431 0.1172 6.343 2.25e-10 ***
## traveltime 0.3559 0.1746 2.039 0.041486 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 396.64 on 378 degrees of freedom
## AIC: 404.64
##
## Number of Fisher Scoring iterations: 4
Then we use the model to predict probabilities for the date that was used to train the model. Based on the probabilities we make predictions for the training data observations so that probability > 0.5 is high alcohol consumption, otherwise low. In the first table below we can see the counts of the true values of high_use and of the predictions. And in the second table are the proportions of the same. From these we can compute that the model gets it (0.63874346 + 0.11780105 = 0.75654451) ~76% times correct, meaning the training error being ~24%. The model gets 91% of the true low alcohol consumption observations correct, but only ~39% of the true high alcohol consumption observations correct.
## prediction
## high_use FALSE TRUE Sum
## FALSE 244 24 268
## TRUE 69 45 114
## Sum 313 69 382
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.63874346 0.06282723 0.70157068
## TRUE 0.18062827 0.11780105 0.29842932
## Sum 0.81937173 0.18062827 1.00000000
As as performance comparison to a simple guessing strategy, with weights 2/3 to low alcohol consumption and 1/3 to high alcohol consumption, we can see that the model beats this guessing strategy. The results from the simple guessing strategy is below. Simple guessing strategy get ~59% correct from the true totals, ~70% of the true low alcohol consumption and ~32% of the true high alcohol consumption. (Note: Since guessing strategy is based on randomizer there might be slight changes on percentages written above and how many time this rmarkdown code is run after what I had written)
## prediction
## high_use 0 1 Sum
## FALSE 187 81 268
## TRUE 77 37 114
## Sum 264 118 382
## prediction
## high_use 0 1 Sum
## FALSE 0.48952880 0.21204188 0.70157068
## TRUE 0.20157068 0.09685864 0.29842932
## Sum 0.69109948 0.30890052 1.00000000
Here is also graphical visualization of the actual and predicted values.
For the 10-fold cross-validation we need to define a loss function. I use the same definition of loss function as was done in the Data Camp exercises. Here is the error-rate of the above trained model computed with the loss function.
## [1] 0.2434555
And here is the error-rate of the trained model wih 10-fold cross-validation. As you can see the model slightly outperforms the model used in Data Camp exercise.
## [1] 0.2539267
In this part of the exercise set I used only variables that are of numeric type (type:int). There are 14 such variables. I use for loop to add a predictor in one-by-one fashion and do the cross-validation step to store the error in betweens. Unfortunately my plotting skills are poor and I was not able to flip the value-labels of x-axis (was able to flip the “curve” but decided not to since it would make it more confusing), hence the errors are represented in this plot in increasing number of predictors order.
## [1] 0.2958115 0.2958115 0.2958115 0.2879581 0.2879581 0.2879581 0.2879581
## [8] 0.2958115 0.2565445 0.2670157 0.2539267 0.2513089 0.2486911 0.2382199
Also for this week I prepared myself by completing DataCamp exercise on “Clustering and classification”.
Step one again was to create chapter4.Rmd to inject its content to the actual course diary as a child of subscipts in index.Rmd.
In this exercise we are using the Boston dataset from MASS packege for R. The dataset contains “Housing Values in Suburbs of Boston” according to the dataset description in here where you can also find details about the dataset columns.
Boston dataset is a class type of: data.frame. The dimensions of the dataset are 506 observations (rows) with 14 variables (columns).
Structure of the dataset is listed below.
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
First, with a summary, and together with the additional information available in here, we can take a look at the different scales of each of the variables. There are different rates, ratios, proprotions, avarages, means, medians and dummy (boolen indicator) as values used for different variables.
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Second, we can look at the density plots of each variable. I was tempted to draw relationship conclusion already based on these diagrams, but as we can see from correlation matrix (coming up next) things are not so obvious.
Third, with a type of correlation matrix we can observe the relationships between the variables. I used a mixed version of the corrplot which was introduced in the DataCamp exercises. From this one you can visually locate the varibles with positive (→ 1) and negative (→ -1) relations from the top triangle and check their coefficients from the lower triangle. We can observe that variable pair (rad, tax) have the highest (> 0.9) positive relationship. Also pairs (indus,nox), (indus,tax), (nox,age) and (rm,mediv) have high posivite relationship (> 0.7). On the high negative relationships (< -0.7) we can mention pairs (indus,dis), (nox,dis), (age,dis) and (lstat,medv). Also noteworthy is that variable chas does not seem to correlate positively or negatively with any other variable in the dataset.
After scaling the variable values are scaled to normal distribution with 0 mean (or very close) and variance 1. Meaning the amout of values are equally numbered on both sides of 0 mean. This can be observed from the density plots after the summary (compare to similar density plot above).
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Then the based on the quantiles of the scaled crime rate variable crim a categorical variable is created. Original variable crim is removed and the new created variable crime is added to the date set. Here is the stucture of the modified dataset.
## 'data.frame': 506 obs. of 14 variables:
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ crime : Factor w/ 4 levels "low","med_low",..: 1 1 1 1 1 1 2 2 2 2 ...
Dataset is split into two, so that 80% of the data is in train set and 20% in the test set.
## [1] 404 14
## [1] 102 14
Here the LDA model is fitted with the train set. The model summary is printed. We can see that LD1 covers more than 0.95 of the group variance. Next the LDA is plotted as biplot. I used just 1 discriminant. There is something going on with the arrows, they don’t work on the same scale as in the DataCamp. In order to even get some of the arrows visible I needed scale way more. (the number of discriminants did not have any effect on this property).
!!Tue 27th!! Found the issue, not the arrows but wrong data to plot :(
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2475248 0.2376238 0.2623762 0.2524752
##
## Group means:
## zn indus chas nox rm
## low 0.93520882 -0.8907091 -0.11484506 -0.8461428 0.4427384
## med_low -0.08995672 -0.2868547 -0.02626030 -0.5436521 -0.1974597
## med_high -0.38045185 0.1548430 0.17338039 0.3563715 0.1188784
## high -0.48724019 1.0171096 -0.04073494 1.0719641 -0.4264826
## age dis rad tax ptratio
## low -0.8504039 0.8282951 -0.6959029 -0.7237406 -0.44783279
## med_low -0.3196882 0.3639736 -0.5488034 -0.4678598 -0.02228321
## med_high 0.3883868 -0.3655824 -0.3870520 -0.2863228 -0.29451506
## high 0.7836698 -0.8384347 1.6382099 1.5141140 0.78087177
## black lstat medv
## low 0.37713478 -0.75166181 0.49724279
## med_low 0.30962322 -0.09439842 -0.04649268
## med_high 0.08005809 -0.02617694 0.19348023
## high -0.69571287 0.87641107 -0.65573310
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.12516906 0.86966092 -0.94732196
## indus -0.02273021 -0.17179618 0.18549215
## chas -0.08212910 -0.07022140 0.07547652
## nox 0.42388257 -0.65600866 -1.21121635
## rm -0.11735441 -0.02662857 -0.24806326
## age 0.30134507 -0.32963416 -0.16565107
## dis -0.05014585 -0.37462365 0.32242714
## rad 2.98404790 1.05684898 -0.14161191
## tax 0.03103018 -0.26536574 0.64453024
## ptratio 0.11510171 0.05685743 -0.19787389
## black -0.13853306 0.04292244 0.11162022
## lstat 0.19938790 -0.23811448 0.41562347
## medv 0.19314756 -0.46312460 -0.11590396
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9479 0.0384 0.0137
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
# fixed this
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 3)
The LDA model fitted in previous step is used to predict the crime class for test data after real test values are removed from the data. Then the predicted class results are compared against the known true values in the table below. We can see that on the test data the predicted values for class high are almost perfect. Class med_high seems to have the worst prediction precision since almost half of the values are predicted to other classes. Class med_low gets almost similarly bad result where 2/3 of the values are predicted to wrong classes. Class low has the second best prediction precision but for that almost 1/3 get predicted to wrong class.
## predicted
## correct low med_low med_high high
## low 15 12 0 0
## med_low 6 16 8 0
## med_high 1 5 14 0
## high 0 0 0 25
First Boston dataset is reloaded and scaled again.
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num -0.419 -0.417 -0.417 -0.416 -0.412 ...
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
Then the euclidian distances between the scaled observations is computed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
Simple initial run of K-means algorithm with randomly chosen number of cluster centers (5). The pairs plot is hard to read when all variables are compared with each other. Splitting the data for pairs plot makes it more readable but you will then miss the comparson between all variables.
To determine the optimal number of clusters total WCSS (within cluster sum of squares) can be used for help. Here I use 14 as the maximum cluster, just based on the number of variables in the dataset, and quick plot the tWCSS results. The optimal number of clusters is when the tWCSS drops dramatically, and here we get the same result (2 clusters) as in the Data Camp exercise.
Running the K-means algorithm again but now with 2 cluster as per optimal suggestion. Plotting the results with pairs to see all variables compared and split to two clusters.
I found the instructions for this bonus exercise a little ambiguous, so I took some freedom since it is not clearly stated how some of the things should be done.
So, first I took the Boston dataset and scaled it. Did nothing to crim, no categorization.
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Then I ran k-means on the scaled data with 3 clusters specified. Showing the pairs plot. Took the cluster vector from results and added that as new column to origal scaled data.
## 'data.frame': 506 obs. of 15 variables:
## $ crim : num -0.419 -0.417 -0.417 -0.416 -0.412 ...
## $ zn : num 0.285 -0.487 -0.487 -0.487 -0.487 ...
## $ indus : num -1.287 -0.593 -0.593 -1.306 -1.306 ...
## $ chas : num -0.272 -0.272 -0.272 -0.272 -0.272 ...
## $ nox : num -0.144 -0.74 -0.74 -0.834 -0.834 ...
## $ rm : num 0.413 0.194 1.281 1.015 1.227 ...
## $ age : num -0.12 0.367 -0.266 -0.809 -0.511 ...
## $ dis : num 0.14 0.557 0.557 1.077 1.077 ...
## $ rad : num -0.982 -0.867 -0.867 -0.752 -0.752 ...
## $ tax : num -0.666 -0.986 -0.986 -1.105 -1.105 ...
## $ ptratio: num -1.458 -0.303 -0.303 0.113 0.113 ...
## $ black : num 0.441 0.441 0.396 0.416 0.441 ...
## $ lstat : num -1.074 -0.492 -1.208 -1.36 -1.025 ...
## $ medv : num 0.16 -0.101 1.323 1.182 1.486 ...
## $ cluster: int 1 1 1 1 1 1 1 1 1 1 ...
I did the same split into train and test dataset, as we did here earlier, so that I can see how well is the model predicting. Most influencial linear separators being medv and chas.
## Call:
## lda(cluster ~ ., data = train)
##
## Prior probabilities of groups:
## 1 2 3
## 0.5148515 0.3391089 0.1460396
##
## Group means:
## crim zn indus chas nox rm
## 1 -0.3941546 0.31227411 -0.6370635 -0.27232907 -0.65912810 -0.01684603
## 2 0.7549052 -0.48724019 1.1552802 -0.09990132 1.11285090 -0.46105942
## 3 -0.3149865 0.02365179 -0.3280701 1.39593374 -0.09317372 1.15993858
## age dis rad tax ptratio black
## 1 -0.6086596 0.6311264 -0.6003372 -0.6010240 -0.07361837 0.3163746
## 2 0.8059442 -0.8487400 1.0652503 1.1624509 0.51852048 -0.5592621
## 3 0.2613110 -0.3401229 -0.3667601 -0.5552257 -0.97842899 0.2623650
## lstat medv
## 1 -0.3954831 0.09021249
## 2 0.9137863 -0.77094599
## 3 -0.6633176 1.38467393
##
## Coefficients of linear discriminants:
## LD1 LD2
## crim 0.01117325 0.09031124
## zn 0.42825390 -0.07804598
## indus 1.27189292 0.21207572
## chas -0.29464473 0.82381171
## nox 0.81644143 -0.31106867
## rm 0.14381601 0.56471935
## age -0.11369692 0.51461099
## dis 0.12926754 -0.35989536
## rad 0.70042482 0.35646727
## tax 0.31049220 -0.22916058
## ptratio 0.06106314 -0.40680906
## black -0.04908228 -0.05389697
## lstat 0.32333200 0.40960024
## medv -0.16431904 0.87258292
##
## Proportion of trace:
## LD1 LD2
## 0.7556 0.2444
## predicted
## correct 1 2 3
## 1 57 1 0
## 2 0 33 0
## 3 2 1 8
And we can see that the precidtion is quite precise. But what these 3 classes should represent with this data (when above we were predicting crime).
In this part it was needed to read two datasets from the internet. Then rename the column names in both datasets, create 2 new variables and the combine the data in to one dataset and store it. I had some problems when reading the stored data, I did not get the same amount of observations that was in the file (I got 164 instead of 195 in the file). Did not see really anything wrong in the file, but I turned quotes to it default value (TRUE) when saving the data, and after that the data with correct amount of observations could be read.
The code for the date can be found from here
And the data can be found from here